Improved Worst-Case Regret Bounds for Randomized Least-Squares Value Iteration

نویسندگان

چکیده

This paper studies regret minimization with randomized value functions in reinforcement learning. In tabular finite-horizon Markov Decision Processes, we introduce a clipping variant of one classical Thompson Sampling (TS)-like algorithm, least-squares iteration (RLSVI). Our $\tilde{\mathrm{O}}(H^2S\sqrt{AT})$ high-probability worst-case bound improves the previous sharpest bounds for RLSVI and matches existing state-of-the-art TS-based bounds.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Optimistic posterior sampling for reinforcement learning: worst-case regret bounds

We present an algorithm based on posterior sampling (aka Thompson sampling) that achieves near-optimal worst-case regret bounds when the underlying Markov Decision Process (MDP) is communicating with a finite, though unknown, diameter. Our main result is a high probability regret upper bound of Õ(D √ SAT ) for any communicating MDP with S states, A actions and diameter D, when T ≥ SA. Here, reg...

متن کامل

Posterior sampling for reinforcement learning: worst-case regret bounds

We present an algorithm based on posterior sampling (aka Thompson sampling) that achieves near-optimal worst-case regret bounds when the underlying Markov Decision Process (MDP) is communicating with a finite, though unknown, diameter. Our main result is a high probability regret upper bound of Õ(D √ SAT ) for any communicating MDP with S states, A actions and diameter D, when T ≥ SA. Here, reg...

متن کامل

Least-Squares Policy Iteration

We propose a new approach to reinforcement learning for control problems which combines value-function approximation with linear architectures and approximate policy iteration. This new approach is motivated by the least-squares temporal-difference learning algorithm (LSTD) for prediction problems, which is known for its efficient use of sample experiences compared to pure temporal-difference a...

متن کامل

Computing Exploration Policies via Closed-form Least-Squares Value Iteration

Optimal adaptive exploration involves sequentially selecting observations that minimize the uncertainty of state estimates. Due to the problem complexity, researchers settle for greedy adaptive strategies that are sub-optimal. In contrast, we model the problem as a belief-state Markov Decision Process and show how a non-greedy exploration policy can be computed using least-squares value iterati...

متن کامل

Incremental Least Squares Policy Iteration for POMDPs

We present a new algorithm, incremental least squares policy iteration (ILSPI), for finding the infinite-horizon policy for partially observable Markov decision processes (POMDPs). The ILSPI algorithm computes a basis representation of the value function by minimizing the Bellman residual and it performs policy improvement in reachable belief states. A number of optimal basis functions are dete...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence

سال: 2021

ISSN: ['2159-5399', '2374-3468']

DOI: https://doi.org/10.1609/aaai.v35i8.16813